Q-DETR: An Efficient Low-Bit Quantized Detection Transformer

31

(1) Quantizing backbone (2) Quantizing encoder

(4) Quantizing MLPs

(3) Quantizing MHA of decoder

(1)

(1) + (2)

(1) + (2) + (3)

(1) + (2) + (3) + (4)

83.3

82.2

81.1

79.3

78.8

4-bit DETR-R50

-1.1

-1.8

-0.5

83.3

80.1

79.3

77.2

76.8

3-bit DETR-R50

-0.8

-2.1

-0.4

FIGURE 2.11

Performance of 3/4-bit quantized DETR-R50 on VOC with different quantized modules.

2a1 1, Qw

n =2b1, Qw

p = 2b11 are the discrete bounds for a-bit activations and

b-bit weights. x generally denotes the activation in this paper, including the input feature

map of convolution and fully-connected layers and input of multi-head attention modules.

Based on this, we first give the quantized fully-connected layer as:

Q-FC(x) = Qa(x) · Qw(w) = αxαw(xqwq + z/αxwq),

(2.25)

where · denotes the matrix multiplication anddenotes the matrix multiplication with

efficient bit-wise operations. The straight-through estimator (STE) [9] is used to retain the

derivation of the gradient in backward propagation.

In DETR [31], the visual features generated by the backbone are augmented with posi-

tion embedding and fed into the transformer encoder. Given an encoder output E, DETR

performs co-attention between object queries O and the visual features E, which are for-

mulated as:

q = Q-FC(O), k, v = Q-FC(E)

Ai = softmax(Qa(q)i · Qa(k)

i /

d),

Di = Qa(A)i · Qa(v)i,

(2.26)

where D is the multi-head co-attention module, i.e., the co-attended feature for the object

query. The d denotes the feature dimension in each head. More FC layers transform the

decoder’s output features of each object query for the final output. Given box and class

predictions, the Hungarian algorithm [31] is applied between predictions and ground-truth

box annotations to identify the learning targets of each object query.

2.4.2

Challenge Analysis

Intuitively, the performance of the quantized DETR baseline largely depends on the in-

formation representation capability mainly reflected by the information in the multi-head

attention module. Unfortunately, such information is severely degraded by the quantized

weights and inputs in the forward pass. Also, the rounded and discrete quantization signif-

icantly affect the optimization during backpropagation.

We conduct the quantitively ablative experiments by progressively replacing each module

of the real-valued DETR baseline with a quantized one and compare the average precision

(AP) drop on the VOC dataset [62] as shown in Fig. 2.11. We find that quantizing the MHA